BioSequence2Vec: Efficient Embedding Generation for Biological Sequences
نویسندگان
چکیده
Representation learning is an important step in the machine pipeline. Given current biological sequencing data volume, explicit representation prohibitive due to dimensionality of resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several (ML) tasks such as sequence classification. Three challenges with kernel methods (i) computation time, (ii) memory usage (storing $$n\times n$$ matrix), (iii) matrices limited kernel-based ML (difficult generalize on non-kernel classifiers). While can be solved using approximate challenge remains typical methods. Similarly, although non-kernel-based applied by extracting principal components (kernel PCA), it may result information loss, while being computationally expensive. In this paper, we propose general-purpose approach that embodies methods’ qualities avoiding computation, memory, generalizability challenges. This involves computing low-dimensional embedding each sequence, random projections its k-mer frequency vectors, significantly reducing needed compute dot product store representation. Our proposed fast alignment-free method used input any distance (e.g., k nearest neighbors) non-distance decision tree) based classification clustering tasks. Using different forms sequences input, perform variety real-world tasks, SARS-CoV-2 lineage gene family classification, outperforming state-of-the-art predictive performance.
منابع مشابه
Efficient Algorithm for Extracting Complete Repeats from Biological Sequences
In this paper, an approach for efficiently extracting the repeating patterns in a biological sequence is proposed. A repeating pattern is a subsequence which appears more than once in a sequence, which is one of the most important features that can be used for revealing functional or evolutionary relationships in biological sequences. The algorithm does a rapid scan of the string to find repeat...
متن کاملEfficient Matching of Biological Sequences Allowing for Non-overlapping Inversions
Inversions are a class of chromosomal mutations, widely regarded as one of the major mechanisms for reorganizing the genome. In this paper we present a new algorithm for the approximate string matching problem allowing for non-overlapping inversions which runs in O(nm) worst-case time and O(m)-space, for a character sequence of size n and pattern of size m. This improves upon a previous O(nm)-t...
متن کاملSEED: efficient clustering of next-generation sequences
MOTIVATION Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. RESULTS Here, we introduce SEED-an efficien...
متن کاملEfficient Coroutine Generation of Constrained Gray Sequences
We study an interesting family of cooperating coroutines, which is able to generate all patterns of bits that satisfy certain fairly general ordering constraints, changing only one bit at a time. (More precisely, the directed graph of constraints is required to be cycle-free when it is regarded as an undirected graph.) If the coroutines are implemented carefully, they yield an algorithm that ne...
متن کاملEfficient Generation of k-Directional Assembly Sequences
Let S be a collection of n rigid bodies in 3-space, and let D be a set of k directions in 3-space, where k is a small constant. A k-directional assembly sequence for S, with respect to D, is a linear ordering hs1; : : : ; sni of the bodies in S, such that each si can be moved to innnity by translating it in one of the directions of D and without intersecting any sj, for j > i. We present an alg...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2023
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-031-33377-4_14